117 research outputs found

    EXPLoRA-web: linkage analysis of quantitative trait loci using bulk segregant analysis

    Get PDF
    Identification of genomic regions associated with a phenotype of interest is a fundamental step toward solving questions in biology and improving industrial research. Bulk segregant analysis (BSA) combined with high-throughput sequencing is a technique to efficiently identify these genomic regions associated with a trait of interest. However, distinguishing true from spuriously linked genomic regions and accurately delineating the genomic positions of these truly linked regions requires the use of complex statistical models currently implemented in software tools that are generally difficult to operate for non-expert users. To facilitate the exploration and analysis of data generated by bulked segregant analysis, we present EXPLoRA-web, a web service wrapped around our previously published algorithm EXPLoRA, which exploits linkage disequilibrium to increase the power and accuracy of quantitative trait loci identification in BSA analysis. EXPLoRA-web provides a user friendly interface that enables easy data upload and parallel processing of different parameter configurations. Results are provided graphically and as BED file and/or text file and the input is expected in widely used formats, enabling straightforward BSA data analysis. The web server is available at http://bioinformatics.intec.ugent.be/explora-web/

    An integrated platform for genome assembly, comparative genomics and management of genomic variation databases

    Get PDF
    The use of long read DNA sequencing technologies is producing an explosion of high-quality de-novo genome assemblies. The availability of these genomes represents a major step forward for evolution, population genomics, epidemiology, among other applications. A major bottleneck for many research groups continues to be the availability of tools to build and analyze the large datasets of genomes that can be produced using these technologies. In this talk, I summarize the functionalities developed by my research group in the version four of the Next Generation Sequencing Experience Platform (NGSEP) to perform a comprehensive analysis of long and short DNA sequencing reads. First, we designed new algorithms for assembly of haploid and diploid samples from long DNA sequencing reads. A minimizers table is constructed from the reads , using K-mer hash codes calculated from rankings relative to the mode of the k-mer counts distribution. Statistics collected during this process are used as features to build layout paths. For diploid samples, we integrated a reimplementation of the ReFHap algorithm to perform molecular phasing. Benchmark experiments using PacBio HiFi and Nanopore sequencing data for different species show that our solution has competitive contiguity and efficiency, as well as superior accuracy in some cases, compared to other currently used software. We also developed a functionality to perform ortholog identification and gene-based alignment of assembled genomes. Proteomes for each genome are extracted and homology relationships are efficiently predicted building indexes of aminoacid sequences by k-mer ocurrance. Then, genes are clustered in orthogroups based on the topology of the graph induced by the predicted relationships. Gene presence/absence matrices are derived from these orthogroups. If genome assemblies are provided as input, synteny relationships are identified for each pair of genomes. We also implemented algorithms to perform alignment of short and long reads to a reference genome. Based on aligned long reads, we improved the classical variants detector to detect long structural variants. Adding up these developments, NGSEP is a comprehensive tool to perform de-novo and reference-based analysis of DNA sequencing reads in a wide variety of experimental settings to solve different research goals.Book of abstract: 4th Belgrade Bioinformatics Conference, June 19-23, 202

    Towards accurate detection and genotyping of expressed variants from whole transcriptome sequencing data

    Get PDF
    BACKGROUND: Massively parallel transcriptome sequencing (RNA-Seq) is becoming the method of choice for studying functional effects of genetic variability and establishing causal relationships between genetic variants and disease. However, RNA-Seq poses new technical and computational challenges compared to genome sequencing. In particular, mapping transcriptome reads onto the genome is more challenging than mapping genomic reads due to splicing. Furthermore, detection and genotyping of single nucleotide variants (SNVs) requires statistical models that are robust to variability in read coverage due to unequal transcript expression levels. RESULTS: In this paper we present a strategy to more reliably map transcriptome reads by taking advantage of the availability of both the genome reference sequence and transcript databases such as CCDS. We also present a novel Bayesian model for SNV discovery and genotyping based on quality scores. CONCLUSIONS: Experimental results on RNA-Seq data generated from blood cell tissue of three Hapmap individuals show that our methods yield increased accuracy compared to several widely used methods. The open source code implementing our methods, released under the GNU General Public License, is available at http://dna.engr.uconn.edu/software/NGSTools/

    Improved linkage analysis of Quantitative Trait Loci using bulk segregants unveils a novel determinant of high ethanol tolerance in yeast

    Get PDF
    Background: Bulk segregant analysis (BSA) coupled to high throughput sequencing is a powerful method to map genomic regions related with phenotypes of interest. It relies on crossing two parents, one inferior and one superior for a trait of interest. Segregants displaying the trait of the superior parent are pooled, the DNA extracted and sequenced. Genomic regions linked to the trait of interest are identified by searching the pool for overrepresented alleles that normally originate from the superior parent. BSA data analysis is non-trivial due to sequencing, alignment and screening errors. Results: To increase the power of the BSA technology and obtain a better distinction between spuriously and truly linked regions, we developed EXPLoRA (EXtraction of over-rePresented aLleles in BSA), an algorithm for BSA data analysis that explicitly models the dependency between neighboring marker sites by exploiting the properties of linkage disequilibrium through a Hidden Markov Model (HMM). Reanalyzing a BSA dataset for high ethanol tolerance in yeast allowed reliably identifying QTLs linked to this phenotype that could not be identified with statistical significance in the original study. Experimental validation of one of the least pronounced linked regions, by identifying its causative gene VPS70, confirmed the potential of our method. Conclusions: EXPLoRA has a performance at least as good as the state-of-the-art and it is robust even at low signal to noise ratio's i.e. when the true linkage signal is diluted by sampling, screening errors or when few segregants are available

    Linkage disequilibrium based genotype calling from low-coverage shotgun sequencing reads

    Get PDF
    Background Recent technology advances have enabled sequencing of individual genomes, promising to revolutionize biomedical research. However, deep sequencing remains more expensive than microarrays for performing whole-genome SNP genotyping. Results In this paper we introduce a new multi-locus statistical model and computationally efficient genotype calling algorithms that integrate shotgun sequencing data with linkage disequilibrium (LD) information extracted from reference population panels such as Hapmap or the 1000 genomes project. Experiments on publicly available 454, Illumina, and ABI SOLiD sequencing datasets suggest that integration of LD information results in genotype calling accuracy comparable to that of microarray platforms from sequencing data of low-coverage. A software package implementing our algorithm, released under the GNU General Public License, is available at http://dna.engr.uconn.edu/software/GeneSeq/. Conclusions Integration of LD information leads to significant improvements in genotype calling accuracy compared to prior LD-oblivious methods, rendering low-coverage sequencing as a viable alternative to microarrays for conducting large-scale genome-wide association studies

    Prototipos para la iluminación diurna, calentamiento de agua y suministro de gas en material reciclado para una vivienda de bajos costos

    Get PDF
    Trabajo de InvestigaciónSe realizan un documento con los procedimientos técnicos y materiales necesarios para la elaboración de tres prototipos para la iluminación diurna, calentamiento de agua y suministro de gas en materiales reciclados.PregradoIngeniero Civi

    Combining image analysis, genome wide association studies and different field trials to reveal stable genetic regions related to panicle architecture and the number of spikelets per panicle in rice

    Get PDF
    Number of spikelets per panicle (NSP) is a key trait to increase yield potential in rice (O. sativa). The architecture of the rice inflorescence which is mainly determined by the length and number of primary (PBL and PBN) and secondary (SBL and SBN) branches can influence NSP. Although several genes controlling panicle architecture and NSP in rice have been identified, there is little evidence of (i) the genetic control of panicle architecture and NSP in different environments and (ii) the presence of stable genetic associations with panicle architecture across environments. This study combines image phenotyping of 225 accessions belonging to a genetic diversity array of indica rice grown under irrigated field condition in two different environments and Genome Wide Association Studies (GWAS) based on the genotyping of the diversity panel, providing 83,374 SNPs. Accessions sown under direct seeding in one environement had reduced Panicle Length (PL), NSP, PBN, PBL, SBN, and SBL compared to those established under transplanting in the second environment. Across environments, NSP was significantly and positively correlated with PBN, SBN and PBL. However, the length of branches (PBL and SBL) was not significantly correlated with variables related to number of branches (PBN and SBN), suggesting independent genetic control. Twenty- three GWAS sites were detected with P ≤ 1.0E-04 and 27 GWAS sites with p ≤ 5.9E−04. We found 17 GWAS sites related to NSP, 10 for PBN and 11 for SBN, 7 for PBL and 11 for SBL. This study revealed new regions related to NSP, but only three associations were related to both branching number (PBN and SBN) and NSP. Two GWAS sites associated with SBL and SBN were stable across contrasting environments and were not related to genes previously reported. The new regions reported in this study can help improving NSP in rice for both direct seeded and transplanted conditions. The integrated approach of high-throughput phenotyping, multi-environment field trials and GWAS has the potential to dissect complex traits, such as NSP, into less complex traits and to match single nucleotide polymorphisms with relevant function under different environments, offering a potential use for molecular breeding

    Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques

    Get PDF
    Determining the underlying haplotypes of individual human genomes is an essential, but currently difficult, step toward a complete understanding of genome function. Fosmid pool-based next-generation sequencing allows genome-wide generation of 40-kb haploid DNA segments, which can be phased into contiguous molecular haplotypes computationally by Single Individual Haplotyping (SIH). Many SIH algorithms have been proposed, but the accuracy of such methods has been difficult to assess due to the lack of real benchmark data. To address this problem, we generated whole genome fosmid sequence data from a HapMap trio child, NA12878, for which reliable haplotypes have already been produced. We assembled haplotypes using eight algorithms for SIH and carried out direct comparisons of their accuracy, completeness and efficiency. Our comparisons indicate that fosmid-based haplotyping can deliver highly accurate results even at low coverage and that our SIH algorithm, ReFHap, is able to efficiently produce high-quality haplotypes. We expanded the haplotypes for NA12878 by combining the current haplotypes with our fosmid-based haplotypes, producing near-to-complete new gold-standard haplotypes containing almost 98% of heterozygous SNPs. This improvement includes notable fractions of disease-related and GWA SNPs. Integrated with other molecular biological data sets, this phase information will advance the emerging field of diploid genomics

    Implementation and modeling of a femtosecond laser-activated streak camera

    Get PDF
    8 June 2017) A laser-activated streak camera was built to measure the duration of femtosecond electron pulses. The streak velocity of the device is 1.89 mrad/ps, which corresponds to a sensitivity of 34.9 fs/pixels. The streak camera also measures changes in the relative time of arrival between the laser and electron pulses with a resolution of 70 fs RMS. A full circuit analysis of the structure is presented to describe the streaking field and the general behavior of the device. We have developed a general mathematical model to analyze the streaked images. The model provides an accurate method to extract the pulse duration based on the changes of the electron beam profile when the streaking field is applied
    corecore